serverless inference
ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
Sui, Yifan, Wang, Hao, Yu, Hanfei, Hu, Yitao, Li, Jianxun, Wang, Hao
Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design a pre-loading method that pre-loads comprehensive LoRA artifacts to minimize cold-start latency. Furthermore, ServerlessLoRA employs contention aware batching and offloading to mitigate GPU resource conflicts during bursty workloads. Experiment on industrial workloads demonstrates that ServerlessLoRA reduces TTFT by up to 86% and cuts monetary costs by up to 89% compared to state-of-the-art LLM inference solutions.
- North America > United States (0.05)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
Reduce the time taken to deploy your models to Amazon SageMaker for testing
Data scientists often train their models locally and look for a proper hosting service to deploy their models. Unfortunately, there's no one set mechanism or guide to deploying pre-trained models to the cloud. In this post, we look at deploying trained models to Amazon SageMaker hosting to reduce your deployment time. SageMaker is a fully managed machine learning (ML) service. With SageMaker, you can quickly build and train ML models and directly deploy them into a production-ready hosted environment.
SageMaker Serverless Inference illustrates Amazon's philosophy for ML workloads
We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Amazon just unveiled Serverless Inference, a new option for SageMaker, its fully managed machine learning (ML) service. The goal for Amazon SageMaker Serverless Inference is to serve use cases with intermittent or infrequent traffic patterns, lowering total cost of ownership (TCO) and making the service easier to use. VentureBeat connected with Bratin Saha, AWS VP of Machine Learning, to discuss where Amazon SageMaker Serverless fits into the big picture of Amazon's machine learning offering and how it affects ease of use and TCO, as well as Amazon's philosophy and process in developing its machine learning portfolio. Inference is the productive phase of ML-powered applications.
Automating machine learning lifecycle with AWS
Machine Learning and data science life cycle involved several phases. Each phase requires complex tasks executed by different teams, as explained by Microsoft in this article. To solve the complexity of these tasks, cloud providers like Amazon, Microsoft, and Google services automate these tasks that speed up end to end the machine learning lifecycle. This article explains Amazon Web Services (AWS) cloud services used in different tasks in a machine learning life cycle. To better understand each service, I will write a brief description, a use case, and a link to the documentation. In this article, machine learning lifecycle can be replaced with data science lifecycle.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.31)
Deploying ML models using SageMaker Serverless Inference (Preview)
Amazon SageMaker Serverless Inference (Preview) was recently announced at re:Invent 2021 as a new model hosting feature that lets customers serve model predictions without having to explicitly provision compute instances or configure scaling policies to handle traffic variations. Serverless Inference is a new deployment capability that complements SageMaker's existing options for deployment that include: SageMaker Real-Time Inference for workloads with low latency requirements in the order of milliseconds, SageMaker Batch Transform to run predictions on batches of data, and SageMaker Asynchronous Inference for inferences with large payload sizes or requiring long processing times. Serverless Inference means that you don't need to configure and manage the underlying infrastructure hosting your models. When you host your model on a Serverless Inference endpoint, simply select the memory and max concurrent invocations. Then, SageMaker will automatically provision, scale, and terminate compute capacity based on the inference request volume.
AWS Launches SageMaker Studio Lab, Free Tool to Learn and Experiment with Machine Learning
AWS has introduced SageMaker Studio Lab, a free service to help developers learn machine-learning techniques and experiment with the technology. SageMaker Studio Lab provides users with all of the basics to get started, including a JupyterLab IDE, model training on CPUs and GPUs and 15 GB of persistent storage. SageMaker Studio Lab has all the basics to create data analytics, scientific computing, and machine-learning projects with notebooks, which can be easily imported and exported via the Git repo or a private Amazon S3 bucket. SageMaker Studio Lab becomes an alternative to the popular Google Colab environment, providing free CPU/GPU access. Another enhancement for AWS SageMaker is a visual, no-code tool called SageMaker Canvas.
Top 12 AI and machine learning announcements at AWS re:Invent 2021
This week during its re:Invent 2021 conference in Las Vegas, Amazon announced a slew of new AI and machine learning products and updates across its Amazon Web Services (AWS) portfolio. Touching on DevOps, big data, and analytics, among the highlights were a call summarization feature for Amazon Lex and a capability in CodeGuru that helps detect secrets in source code. Amazon's continued embrace of AI comes as enterprises express a willingness to pilot automation technologies in transitioning their businesses online. Fifty-two percent of companies accelerated their AI adoption plans because of the COVID pandemic, according to a PricewaterhouseCoopers study. Meanwhile, Harris Poll found that 55% of companies accelerated their AI strategy in 2020 and 67% expect to further accelerate their strategy in 2021.